Overview and Assumptions

What is optimal price point per targeted group. How is this determined? To answer this question, a simluation will be created using distributions and parameters that are determined from a Kaggle dataset and findings from published journals.

Variables:

How is N number of surveys determined? (for future simulation) N is the sample size needed.

  1. How many people in the population?
  2. How representative do your survey results need to be? (Margin of Error, Confidence Level, Power Calculation)
  3. How many I# of Impressions will be needed to achieve C# of Clicks AND how many of those clicks will result in N# of Approved Conversions

N# of Approved Conversions = (Approved Conversions / Clicks)_g x Clicks_i(Interest_Group_g x I# of Impressions)

Expected Cost for the client = N# of Approved Conversions + Cost of Ads + (Time x Incentive)

N# of Approved Conversions needed = N# of Approved Conversions obtained x Time x Incentive (Bayesian Inference) Rate of Approved Conversion = AP/Time x Incentive

Kaggle Facebook Data

Characteristics from this dataset will be used in the parameterizing the model.

Downloaded from https://www.kaggle.com/loveall/clicks-conversion-tracking/version/1

The data used in this project is from an anonymous organisation’s social media ad campaign. The data file can be downloaded from here. The file conversion_data.csv contains 1143 observations in 11 variables. Below are the descriptions of the variables.

1.) ad_id: an unique ID for each ad. 2.) xyz_campaign_id: an ID associated with each ad campaign of XYZ company. 3.) fb_campaign_id: an ID associated with how Facebook tracks each campaign. 4.) age: age of the person to whom the ad is shown. 5.) gender: gender of the person to whim the add is shown 6.) interest: a code specifying the category to which the person’s interest belongs (interests are as mentioned in the person’s Facebook public profile). 7.) Impressions: the number of times the ad was shown. 8.) Clicks: number of clicks on for that ad. 9.) Spent: Amount paid by company xyz to Facebook, to show that ad. 10.) Total conversion: Total number of people who enquired about the product after seeing the ad. 11.) Approved conversion: Total number of people who bought the product after seeing the ad.

ad_id is labeled numerically, but is nominal, meaning it’s just a name. After noticing major gaps between the numbers, it is apparent that much of the data are missing.

summary(KAG)
##      ad_id      xyz_campaign_id fb_campaign_id    age      gender 
##  708746 :   1   1: 54           420    :   6   30-34:426   F:551  
##  708749 :   1   2:464           433    :   6   35-39:248   M:592  
##  708771 :   1   3:625           467    :   6   40-44:210          
##  708815 :   1                   477    :   6   45-49:259          
##  708818 :   1                   502    :   6                      
##  708820 :   1                   538    :   6                      
##  (Other):1137                   (Other):1107                      
##     interest    Impressions         Clicks           Spent      
##  5      :140   Min.   :   1.0   Min.   :  1.00   Min.   :  1.0  
##  3      : 85   1st Qu.: 274.5   1st Qu.:  2.00   1st Qu.: 52.0  
##  17     : 77   Median : 559.0   Median :  9.00   Median :298.0  
##  15     : 60   Mean   : 560.7   Mean   : 30.05   Mean   :335.2  
##  4      : 51   3rd Qu.: 844.5   3rd Qu.: 38.50   3rd Qu.:583.5  
##  16     : 51   Max.   :1130.0   Max.   :183.00   Max.   :869.0  
##  (Other):679                                                    
##  Total_Conversion Approved_Conversion
##  Min.   : 1.000   Min.   : 1.000     
##  1st Qu.: 2.000   1st Qu.: 1.000     
##  Median : 2.000   Median : 2.000     
##  Mean   : 3.802   Mean   : 1.933     
##  3rd Qu.: 4.000   3rd Qu.: 2.000     
##  Max.   :32.000   Max.   :16.000     
## 
head(KAG)
## # A tibble: 6 x 11
##   ad_id xyz_campaign_id fb_campaign_id age   gender interest Impressions
##   <fct> <fct>           <fct>          <fct> <fct>  <fct>          <dbl>
## 1 7087~ 1               1              30-34 M      4                289
## 2 7087~ 1               2              30-34 M      5                420
## 3 7087~ 1               3              30-34 M      8                 50
## 4 7088~ 1               4              30-34 M      16               213
## 5 7088~ 1               4              30-34 M      16               212
## 6 7088~ 1               5              30-34 M      17               130
## # ... with 4 more variables: Clicks <dbl>, Spent <dbl>,
## #   Total_Conversion <dbl>, Approved_Conversion <dbl>
ecdf(as.numeric(as.character(KAG$ad_id))) %>% plot()

max(as.numeric(as.character(KAG$ad_id))) - min(as.numeric(as.character(KAG$ad_id)))
## [1] 605669

The graph shows that massive amounts of data are missing. The data suggest that only 1,143 observations are given out of a range of 605,669. The data only show observations with at least 1 Click, 1 Total Conversion, and 1 Approved Conversion, so we have no ability to predict how often our predictor variables did not get a click, and considering the range vs what we have, we can only guess that this happened, roughly, 600 times more often.

This is not an advertising campaign, but the ad cost model will be on Cost per 1000 Impressions. This data appears to be based on some combination or cost per click AND cost per impression, being that some Spent variables are 1 when Click is one, while Impressions can be any number. I believe the data are summarized, so this makes building a regression model far more difficult.

How FB Ads work

Ads are ranked based on who they’re targeting, the higher the bid does not mean the bid will win. FB adjusts bids based on our budget and time left. FB determines which Ad wins the auction by how relevant it is to the person who will see it. This works the same way on Instagram, Messenger, and Audience Network. Ads are ranked by Total Value, a numeric value calculated based on multiple factors, including the bid, predictions on how their audience will react, and how relevant they find the ad. Account relevance and Ad quality make the ad competitive:

[Advertiser Bid] x [Estimated Action Rates] + [User Value] = Total Value

[Advertiser Bid] = bid strategy, bid cap or cost target

[Estimated Action Rates] = optimization event. Based on past historial performance. advertiser values converted to effective CPM Cost per 1,000 Impressions, total amount spent divided by Impressions * 1,000.

opt for impressions: eCPM = (Advertiser bid per impressions) opt for clicks: eCPM = (Advertiser bid per click) x (estimated click-through rate) x 1,000 opt for actions: eCPM = (Advertiser bid per action) x (estimated click-through rate) x (estimated conversion rate) 1,000

Ad type Bid Outcome eCTR eCVR per 1,000 impressions served equals eCPM (dollars)
Ad 1 $10 per 1,000 impressions 1 equals 10
Ad 2 $1 per One (1) click 1% 1,000 equals 10
Ad 3 $100 per One (1) conversion 1% 1% 1,000 equals 10

eCTR = Clicks/Impressions eCVR = Approved_Conversions/Clicks * 100

Determining the random variables

Impressions and clicks are what will be used to predict spending, so these variables need to be randomized in our simulation.

rvariables <- KAG %>% 
  group_by(age, gender, interest) %>%
  summarize(mImp = mean(Impressions),
            vImp = sqrt(var(Impressions)),
            mClicks = mean(Clicks),
            vClicks = sqrt(var(Clicks)),
            CPM = round(mean(Spent/Impressions)*10,3))

ggplot(rvariables, aes(x=mImp)) +  
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(rvariables, aes(x=mClicks)) +  
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(rvariables, aes(x=CPM)) +  
  geom_histogram(colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Impressions appear to be somewhat normally distributed. Clicks appear poisson or negative binomially distributed. CPM looks like it could be normally distributed but skewed to the left. NOTE Remember that this data only contained ads that resulted in at least 1 click and 1 approved conversion, so it will heavily favor ads that we not seen often, since they also resulted in a conversion.

The number of Impressions is entirely randomly distributed. Impressions is based on Facebook’s relevance score; Facebook will only show the ad based on how well previous ads aligned with the viewers. The relevance score will determine how many Impressions Facebook will allow the ad to have. This means that using a random number of impressions on different groups doesn’t work, since each group will have a different relevance score.

Consider the following data from one specific group and observe how the Spent variable fluctuates. We can see a large variability in Impressions and the cost to reach this group.

KAG %>%
  select(age, gender, interest, Impressions, Clicks, Spent) %>%
  mutate(Cost_Click = Spent/Clicks,
         CTR = (Clicks/Impressions)*1000,
         bid = Spent/CTR,
         CPM = round((Spent/Impressions)*10,3)) %>%
  filter(age == "30-34",
         gender == "M",
         interest == "4")
## # A tibble: 15 x 10
##    age   gender interest Impressions Clicks Spent Cost_Click    CTR    bid
##    <fct> <fct>  <fct>          <dbl>  <dbl> <dbl>      <dbl>  <dbl>  <dbl>
##  1 30-34 M      4                289      2    47      23.5    6.92  6.79 
##  2 30-34 M      4                405      4   176      44      9.88 17.8  
##  3 30-34 M      4                392      4   140      35     10.2  13.7  
##  4 30-34 M      4                154      1     1       1      6.49  0.154
##  5 30-34 M      4                150      1     1       1      6.67  0.15 
##  6 30-34 M      4                281      1     1       1      3.56  0.281
##  7 30-34 M      4                508      7   275      39.3   13.8  20.0  
##  8 30-34 M      4                435      3   122      40.7    6.90 17.7  
##  9 30-34 M      4                232      1     1       1      4.31  0.232
## 10 30-34 M      4                116      1     1       1      8.62  0.116
## 11 30-34 M      4                496      2    75      37.5    4.03 18.6  
## 12 30-34 M      4               1117    110   805       7.32  98.5   8.17 
## 13 30-34 M      4               1130    179   869       4.85 158.    5.49 
## 14 30-34 M      4                495      2    64      32      4.04 15.8  
## 15 30-34 M      4                938     31   558      18     33.0  16.9  
## # ... with 1 more variable: CPM <dbl>

The Facebook Relevence Score

A rating from 1 to 10 that estimates how well your target audience is responding to your ad. This score is shown after your ad receives more than 500 impressions. The higher the score, the better. Relevance scores take about 3-7 days, depending on your budget. Usually the ad algorithm takes about 7 days to kick in and start its learning phase so the auction can work more effectively.

The Relevance Score measures quality and engagement level of the ads. It determines the Cost per Click and how frequently FB shows the ad. It is primarily based on expected positive and negative feedback. The factors are:

  • Audience Definition
  • Ad relevance and Freshness
  • Expected Feedback
  • Campaign Objective

Negative feedback has about 10x the impact as positive feedback. Facebook will get rid of relevance scores on April 30th. It will be using:

  • Quality ranking - measure an ad’s perceived quality compared to ads competing for the same target audience (Optimal Incentive will be big here)
  • engagement rate ranking - an ad’s expected engagement rate compared to ads competing for the same audience
  • conversion rate ranking - expected conversion rates when compared to ads with the same optimization goals and audience

These metrics will depend heavily on the quality of the ad visuals and message.

1st Simulation

Facebook calculates a relevance score after 500 impresions, so we will begin with 500 random observations.

theory <- tibble(relevance = round(runif(500, min = 0, max = 10), 1),
                 cost = (0))
theory <- theory %>%
  mutate(cost = ifelse(relevance >=8, abs(rnorm(length(relevance), 11-relevance, 2)), 
                       ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), 11-relevance, 2.5))+1, abs(rnorm(length(relevance), 11-relevance, 2)+2))),
         quality = ifelse(relevance >=8, abs(rnorm(length(relevance), relevance, 1.8)), 
                          ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), relevance, 1.5)), abs(rnorm(length(relevance), relevance, 1.2)))),
         group =   ifelse(relevance >= 8, "1",
                          ifelse(relevance < 8 & relevance >4, "2", "3")),
         quality = ifelse(quality >10, 10, quality))
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(theory, x = ~relevance, y = ~quality,  z = ~cost, color = ~group) %>%
  add_markers()

The model that I have in mind looks like this:

  1. The higher the relevance, the lower the ad cost, the greater the quality of the respondent.
  2. A middle range of relevance will have the most variation, since some may be just as interested in the survey as those responents whom Facebook determined to be more relevant.
  3. A low relevance score should not vary much. We expect a high ad cost and disinterest in the survey.
  4. The groups are colored acording to the relevance score.
xexp <- log(10, base = exp(1.3))
customGreen0 = "#DeF7E9"
customGreen = "#71CA97"
customRed = "#ff7f7f"
Simulation <- function(days = 30, 
                       daily_impressions=2000, 
                       mean_relevance = 7.2, 
                       sd_relevance = .8,
                       relevance_filter = 7,
                       probability_sd = 0.024,
                       cost_sd_rh = 1.75,
                       cost_sd_rm = 2,
                       cost_sd_rl = 2,
                       dollar_incentive_min = 2,
                       dollar_incentive_max = 20,
                       quality_reduction_slope = -1/25,
                       quality_reduction_intercept = 1.4
                       ){

Simulations <- list()

for(i in 1:days){
  
  t_sim <- tibble(relevance = round(rnorm(daily_impressions, mean = mean_relevance, sd = sd_relevance), 1),
                      cost = ifelse(relevance >=8, abs(rnorm(length(relevance), 11-relevance, cost_sd_rh)), 
                                    ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), 11-relevance, cost_sd_rm))+1, abs(rnorm(length(relevance), 11-relevance, cost_sd_rl)+2))),
                      quality = ifelse(relevance >=8, abs(rnorm(length(relevance), relevance, 1.8)), 
                                       ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), relevance, 1.5)), abs(rnorm(length(relevance), relevance, 1.2)))),
                      group =  ifelse(relevance >= 8, "1",
                                      ifelse(relevance < 8 & relevance >4, "2", "3"))) %>%
                filter(relevance >= relevance_filter)
  
  Simulations[[i]] <- t_sim %>%

    mutate(base_prob = (relevance - relevance_filter)/100,
           prob_variation = abs(round(rnorm(nrow(t_sim), base_prob, probability_sd),3)),
           prob_variation = ifelse(prob_variation <= 0, round(runif(nrow(t_sim), .001, .009),3),prob_variation),
       incentive = round(runif(nrow(t_sim), dollar_incentive_min,dollar_incentive_max)),
           inc_prob = prob_variation * log(incentive, base = exp(xexp)),
           survey = rbinom(nrow(t_sim), 1, inc_prob),
           qualdecr = quality_reduction_slope*incentive+quality_reduction_intercept,
           quality_var = quality *qualdecr) %>%
    select(relevance, cost, quality_var, incentive, inc_prob, survey) %>%
    mutate(Ad_Cost = round(cost, 3),
           Response_Quality = round(quality_var, 3),
           Survey_Probability = round(inc_prob*100, 3),
           Dollar_Incentive = incentive) %>%
    rename(FB_Relevance_Score = relevance, Survey_Completed = survey) %>%
    select(FB_Relevance_Score, Ad_Cost, Dollar_Incentive, Response_Quality, Survey_Probability, Survey_Completed) %>%
    arrange(desc(FB_Relevance_Score))
}
return(do.call(bind_rows, list(.id = "Day", Simulations)))
}

Simulation_Summary <- function(sim, by = "all"){
byday <- sim %>%
  mutate(Day = as.numeric(Day)) %>%
  group_by(Day) %>%
  summarize(Total_Impressions = n(),
            Surveys_Completed = sum(Survey_Completed),
            Mean_Probability = round(mean(Survey_Probability),2),
            Mean_Incentive = round(mean(ifelse(Survey_Completed==1, Dollar_Incentive, NA), na.rm = TRUE),2),
            Total_Incentive_Cost = sum(ifelse(Survey_Completed==1, Dollar_Incentive, 0)),
            Mean_Relevance = round(mean(FB_Relevance_Score),2),
            Mean_Response_Quality = round(mean(Response_Quality,2))) %>%
  arrange(Day) %>%
  mutate(Day = as.character(Day))

total <- sim %>%
  summarize(Day = paste0(nrow(byday), " Days"),
            Total_Impressions = n(),
            Surveys_Completed = sum(Survey_Completed),
            Mean_Probability = round(mean(Survey_Probability),2),
            Mean_Incentive = round(mean(ifelse(Survey_Completed==1, Dollar_Incentive, NA), na.rm = TRUE),2),
            Total_Incentive_Cost = sum(ifelse(Survey_Completed==1, Dollar_Incentive, 0)),
            Mean_Relevance = round(mean(FB_Relevance_Score),2),
            Mean_Response_Quality = round(mean(Response_Quality,2)))


if(by == "all"){
  bind_rows(total, byday)
} else if(by == "total"){
  total
} else if(by == "day"){
  byday
} else if(by == "csurvey"){  
  sim %>%
  filter(Survey_Completed==1)
} else { 
  sim
} %>% 
  formattable(align = c("c", "c", "c", "c","c","c","c"),
              list('FB_Relevance_Score' = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
                       'Ad_Cost' = color_tile(customGreen0, customGreen),
                       'Dollar_Incentive' = color_tile(customGreen0, customGreen),
                       'Response_Quality' = color_tile(customGreen0, customGreen),
                       'Survey_Probability' = color_tile(customGreen0, customGreen)))
}

The actual simulation that we’ll be using acounts for a much greater variability, as well as interactions between different variables.

theory <- Simulation(days = 1, 
                       daily_impressions=500, 
                       mean_relevance = 5, 
                       sd_relevance = 2,
                       relevance_filter = 0,
                       probability_sd = 0.024,
                       cost_sd_rh = 1.75,
                       cost_sd_rm = 2,
                       cost_sd_rl = 2,
                       dollar_incentive_min = 2,
                       dollar_incentive_max = 20,
                       quality_reduction_slope = -1/25,
                       quality_reduction_intercept = 1.4
                       ) 
#Output will be FB_Relevance_Score, Ad_Cost, Dollar_Incentive, Response_Quality, Survey_Probability, Survey_Completed.

library(plotly)

plot_ly(theory, x = ~FB_Relevance_Score, y = ~Response_Quality,  z = ~Ad_Cost, color = ~Survey_Probability, type = 'scatter3d', mode = 'markers', marker = list(size = 3))

The 1st theoretical model is far more simplistic, but we expect behavior to resemble the latter model. The dots are colored by the probability that the respondent will complete the survey.

Facebook will be unlikely to show the ad if the relevance score is low, so we expect only ads with a relevance score greater 6 to be shown.

To get a better understanding of the relationship between quality and relevance score, we look at the following plot.

ggplot(theory, aes(x = FB_Relevance_Score, y = Response_Quality, color = Survey_Probability)) + geom_point()

We see that there are fewer low quality respondents when relevance score is high. We expect low relevance to yield low quality responses.

KAG3.1 <- KAG %>%
  filter(xyz_campaign_id == 3) %>%
  mutate(CPM =(Clicks/Impressions)*1000,
         SPM =(Spent/Impressions)*1000,
         cpc = Spent/Clicks,
         apc = Approved_Conversion/Clicks) %>% 
  mutate(relevance_approx = ifelse(CPM < 39, "r7",
                     ifelse(CPM > 39 & CPM < 83, "r8", "r9up")))

histo <- KAG3.1 %>%
  ggplot(aes(x=CPM, color = relevance_approx)) +
  geom_histogram()

ggplotly(histo)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We need to determine the impact that Relevance has on Click Through Rate and Impressions, so we look at the previous histogram.

In order to determine how relevance score is associated with the likelihood of completing a survey, I used the previously mentioned data from Kaggle. Relevance score was not included so I used Clicks per 1000 Impressions as a measure of interest on the ad. The histogram shows how CPM will be used to approximate relevance. The higher the CPM, the higher likelihood that the user will take the survey.

I wanted to use Cost per 1000 Impressions but this data is too heavily skewed to the left. Because the data only shows conversions, Cost per 1000 Impressions will be much lower for observations with low impressions.

library(formattable)
## Warning: package 'formattable' was built under R version 3.5.3
## 
## Attaching package: 'formattable'
## The following object is masked from 'package:plotly':
## 
##     style
KAG3.1 %>%
  select(relevance_approx, cpc, apc, SPM) %>%
  group_by(relevance_approx) %>%
  summarize(Mean_APC = mean(apc),
            SD_APC = sd(apc),
            Mean_CPM = mean(SPM),
            total_observations = n()) %>%
  formattable()
relevance_approx Mean_APC SD_APC Mean_CPM total_observations
r7 0.22330038 0.28503202 506.4588 314
r8 0.04472960 0.02482618 711.9548 159
r9up 0.03184836 0.02429892 757.2053 152

Mean Actions Per Click show that the higher the CPM, the lower the mean APC. This is a fault of the data, since only ads with clicks were shown. This means that ads with low impressions will be favored. We notice that between R8 and R9 that the mean is about .01 different and the standard deviations are very similar, despite having many observations. I believe that the mean APC will be .03 for R9, .2 for R8, and .1 for R7, with a standard deviation of .024.

We also see that these ad campaigns costed on between 5.06 to 7.50 dollars.

Based on a past study, I read that a $10 incentive increased the likelihood of a person taking the survey. I also read that incentives may have a ceiling. This means that as the incentive increases, the probability of a person taking the survey also increases but at a diminishing rate. This appears to follow a logarithmic function.

log(10, base = exp(1.3))
## [1] 1.771219

This logarithm uses a base ^{1.3} rather than the standard base e to meet this assumption.

xexp <- log(10, base = exp(1.3))
log(10, base = exp(xexp))
## [1] 1.3
incentives <- seq(1,20,1)

incentbydollar <- tibble(incentive = log(incentives, base=exp(xexp)),
              dollars = incentives,
              qualdecr = (-1/25)*incentives+1.4)


ggplot(incentbydollar, aes(x=dollars, y = incentive)) + 
  geom_point() + 
  geom_abline(intercept = (1-1/xexp), slope = 1/(exp(xexp)*xexp)) + 
  geom_hline(yintercept = 1, linetype = "dashed", color = "blue", size = .5) +
  geom_vline(xintercept = exp(xexp), linetype = "dashed", color = "blue", size = .5) + 
  geom_point(aes(x=dollars, y = qualdecr, color = "red"))

This model says that from 0 - 6 dollars, the probability of taking a survey is reduced, but the probability increases at a fast rate for every dollar given. At 6 dollars, the probability of taking the survey is multiplied by 1. Every dollar after this increases the probability by more, but at a diminishing rate. At 10 dollars, the probability increases by 30% as the previous study concluded. The tangent line shows the rate of increased probability when the incentive multiplier equals 1.

I also read that as the incentive increases, the quality of the survey responses decrease. Greater incentive encourages respondents who are less motivated by the survey and more by the money, also, it creates a feeling of “being bribed”. This means that greater incentive should negatively impact quality. The red dots demonstrate this and will be used as a multiplier against quality.

The scalers are applied and the probability of completing the survey is randomly generated through the binomial distribution (probability and standard deviation based on the the Kaggle data).

relv <- theory %>% filter(FB_Relevance_Score >= 7)
plot_ly(relv, x = ~FB_Relevance_Score, y = ~Response_Quality,  z = ~Dollar_Incentive, color = ~Survey_Probability, symbol = ~Survey_Completed, symbols = c("o", "circle"), type = 'scatter3d', mode = 'markers', marker = list(size = 5))

The plot shows which of the respondents completed a survey with a filled dot. The are plotted with their relevance scores, the expected quality of their responses, and the amount of their incentive. The color shows the probabilty that they would, the lighter the color, the higher the probability.

relv %>% Simulation_Summary(by="csurvey")
## # A tibble: 12 x 7
##    Day   FB_Relevance_Sc~ Ad_Cost Dollar_Incentive Response_Quality
##    <chr>            <dbl>   <dbl>            <dbl>            <dbl>
##  1 1                  9.1    1.46               17             8.43
##  2 1                  8.6    3.23               17             6.25
##  3 1                  8.5    1.85               17             6.92
##  4 1                  8      2.98                2             8.76
##  5 1                  7.9    2.19                8             6.08
##  6 1                  7.9    6.39               14             7.57
##  7 1                  7.4    1.45                8             8.25
##  8 1                  7.3    6.02               15             6.96
##  9 1                  7.2    6.50               15             6.52
## 10 1                  7.2    9.05                7             8.10
## 11 1                  7.2    1.45                3             7.16
## 12 1                  7.1    2.54               14             6.42
## # ... with 2 more variables: Survey_Probability <dbl>,
## #   Survey_Completed <int>

This is the summary of this sample. The results are random at every iteration, but we should expect to see that Ad costs should be lower, Dollar Incentive should be somewhere above 10, response quality varied, and survey probabilities higher with high relevance scores.

Simulating a Day

Now that we and Facebook know how our ads work with certain groups, we begin to target those groups better. This simulation will be better placed, with 2000 Impressions per day, and a normally distributed relevance score, with mean 7.2 and sd = 0.8.

Day_sim <- Simulation(days = 1, 
                       daily_impressions=2000, 
                       mean_relevance = 7.2, 
                       sd_relevance = .8,
                       relevance_filter = 6,
                       probability_sd = 0.024,
                       cost_sd_rh = 1.75,
                       cost_sd_rm = 2,
                       cost_sd_rl = 2,
                       dollar_incentive_min = 2,
                       dollar_incentive_max = 20,
                       quality_reduction_slope = -1/25,
                       quality_reduction_intercept = 1.4
                       ) 
plot_ly(Day_sim, x = ~FB_Relevance_Score, y = ~Response_Quality,  z = ~Dollar_Incentive, color = ~Survey_Probability, symbol = ~Survey_Completed, symbols = c("o", "circle"), type = 'scatter3d', mode = 'markers', marker = list(size = 3))
Day_summary = Day_sim %>%
  filter(Survey_Completed==1) %>%
  summarize(FB_Relevance_Score=mean(FB_Relevance_Score),
            Ad_Cost=mean(Ad_Cost),
            Dollar_Incentive=mean(Dollar_Incentive),
            Response_Quality=mean(Response_Quality),
            Survey_Probability=mean(Survey_Probability),
            Surveys_Completed = sum(Survey_Completed))

Day_summary %>%
formattable(align = c("c", "c", "c","c","c"),
              list('FB_Relevance_Score' = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
                       'Ad_Cost' = color_tile(customGreen0, customGreen),
                       'Dollar_Incentive' = color_tile(customGreen0, customGreen),
                       'Response_Quality' = color_tile(customGreen0, customGreen),
                       'Survey_Probability' = color_tile(customGreen0, customGreen)))
FB_Relevance_Score Ad_Cost Dollar_Incentive Response_Quality Survey_Probability Surveys_Completed
7.298 4.82168 11.42 6.8581 4.6529 50
Simulation_Summary(Day_sim, by = "csurvey")
## # A tibble: 50 x 7
##    Day   FB_Relevance_Sc~ Ad_Cost Dollar_Incentive Response_Quality
##    <chr>            <dbl>   <dbl>            <dbl>            <dbl>
##  1 1                  9.5   0.173                7             8.52
##  2 1                  8.9   0.894               14             8.51
##  3 1                  8.6   2.50                16             8.76
##  4 1                  8.5   3.97                16             6.40
##  5 1                  8.2   2.77                 6             9.15
##  6 1                  8     0.918               12             5.51
##  7 1                  7.9   2.94                17             3.91
##  8 1                  7.8   5.20                 8             7.88
##  9 1                  7.8   7.05                14             4.08
## 10 1                  7.7   5.07                13             3.52
## # ... with 40 more rows, and 2 more variables: Survey_Probability <dbl>,
## #   Survey_Completed <int>

Simulating a Month

Month_sim <- Simulation(days = 30, 
                       daily_impressions=2000, 
                       mean_relevance = 7.2, 
                       sd_relevance = .8,
                       relevance_filter = 6,
                       probability_sd = 0.024,
                       cost_sd_rh = 1.75,
                       cost_sd_rm = 2,
                       cost_sd_rl = 2,
                       dollar_incentive_min = 2,
                       dollar_incentive_max = 20,
                       quality_reduction_slope = -1/25,
                       quality_reduction_intercept = 1.4
                       ) 
table(Month_sim$Survey_Completed)
## 
##     0     1 
## 54878  1588
Month_sim %>%
  ggplot(aes(x=Dollar_Incentive)) +
  geom_histogram(bins=19, color = "white") + 
  facet_grid(Survey_Completed~., scales = "free")

Month_sim %>%
  ggplot(aes(x=Response_Quality)) +
  geom_histogram(bins=19, color = "white") + 
  facet_grid(Survey_Completed~., scales = "free")

Month_sim %>%
  ggplot(aes(x= FB_Relevance_Score)) +
  geom_histogram(bins=10, color = "white") + 
  facet_grid(Survey_Completed~., scales = "free")

Month_sim %>%
  ggplot(aes(x= Ad_Cost)) +
  geom_histogram(bins=20, color = "white") + 
  facet_grid(Survey_Completed~., scales = "free")

Month_sim %>%
  select(-Day, -Survey_Completed) %>%
cor() %>% 
  formattable()
##                    FB_Relevance_Score Ad_Cost   Dollar_Incentive
## FB_Relevance_Score 1                  -0.4536   0.001087        
## Ad_Cost            -0.4536            1         -0.003953       
## Dollar_Incentive   0.001087           -0.003953 1               
## Response_Quality   0.302              -0.1361   -0.6709         
## Survey_Probability 0.1594             -0.07859  0.3188          
##                    Response_Quality Survey_Probability
## FB_Relevance_Score 0.302            0.1594            
## Ad_Cost            -0.1361          -0.07859          
## Dollar_Incentive   -0.6709          0.3188            
## Response_Quality   1                -0.1679           
## Survey_Probability -0.1679          1

We see that mount of surveys completed 54878, 1588 and several histograms showing the differences in the respondents who completed the survey against those who didn’t. The first is the most notable, showing that as incentive increased, as did the number of respondents. The height of the histograms is not scaled, so the note the counts on the left side. The correlation matrix shows that none of the variables are too related, save for the negative association between Incentive and Quality.

library(ggridges)
## Warning: package 'ggridges' was built under R version 3.5.3
## 
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
## 
##     scale_discrete_manual
Month_sim <- Simulation()
Month_sim %>%
  filter(Survey_Completed==1) %>%
  select(Dollar_Incentive, Response_Quality) %>%
  arrange(Dollar_Incentive, Response_Quality) %>% 
  mutate(Dollar_Incentive = as.factor(Dollar_Incentive)) %>%
  ggplot(aes(x = Response_Quality, y = Dollar_Incentive, height = ..density..)) + 
  geom_density_ridges_gradient(aes(fill = ..x..)) +
  scale_fill_gradientn(
    colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
    name = "Quality"
  )+
  labs(title = 'Response Quality vs Incentive')
## Picking joint bandwidth of 0.639

Here, we get a better understanding of the relationship between Incentive and Quality. The simulation has these two with a negative linear correlation.

Summary of the Simulation

summary_Month_sim <- Simulation_Summary(Month_sim)
summary_Month_sim
## # A tibble: 31 x 8
##    Day   Total_Impressio~ Surveys_Complet~ Mean_Probability Mean_Incentive
##    <chr>            <int>            <int>            <dbl>          <dbl>
##  1 30 D~            37375              979             2.58           12.7
##  2 1                 1205               37             2.62           11.9
##  3 2                 1213               29             2.55           13.9
##  4 3                 1217               32             2.56           12.6
##  5 4                 1222               31             2.66           14.0
##  6 5                 1245               33             2.57           12.5
##  7 6                 1244               33             2.6            12.8
##  8 7                 1261               30             2.63           12.3
##  9 8                 1284               32             2.7            12.7
## 10 9                 1261               27             2.53           11.5
## # ... with 21 more rows, and 3 more variables: Total_Incentive_Cost <dbl>,
## #   Mean_Relevance <dbl>, Mean_Response_Quality <dbl>

Now we see that we have completed 979 at the cost of 1.242610^{4}. We must fund the optimal incentive cost with regard to the highest probability of completing a survey.

Month_sim %>%
  group_by(Dollar_Incentive) %>%
  summarize(n_impressions = n(),
            n_surveys = sum(Survey_Completed),
            survey_cost = sum(ifelse(Survey_Completed==1,Dollar_Incentive,0)),
            ad_cost = sum(Ad_Cost/100),
            mean_prob = mean(Survey_Probability),
            sd_prob = sd(Survey_Probability)) %>%
  arrange(desc(Dollar_Incentive))
## # A tibble: 19 x 7
##    Dollar_Incentive n_impressions n_surveys survey_cost ad_cost mean_prob
##               <dbl>         <int>     <int>       <dbl>   <dbl>     <dbl>
##  1               20          1075        39         780    44.5     3.41 
##  2               19          2104        75        1425    86.3     3.42 
##  3               18          2020        77        1386    83.8     3.34 
##  4               17          2118        80        1360    86.8     3.16 
##  5               16          2040        60         960    83.3     3.18 
##  6               15          2156        79        1185    88.6     2.98 
##  7               14          2061        72        1008    84.6     3.04 
##  8               13          2128        62         806    88.2     2.92 
##  9               12          2046        63         756    85.0     2.86 
## 10               11          2060        40         440    83.7     2.71 
## 11               10          2099        52         520    87.3     2.66 
## 12                9          2055        49         441    83.5     2.56 
## 13                8          2110        55         440    85.5     2.40 
## 14                7          1988        36         252    83.3     2.24 
## 15                6          2118        48         288    86.6     2.10 
## 16                5          2072        36         180    84.8     1.85 
## 17                4          2094        36         144    87.4     1.61 
## 18                3          2034        15          45    84.4     1.25 
## 19                2           997         5          10    40.4     0.772
## # ... with 1 more variable: sd_prob <dbl>

This gives us a better look at how expensive these incentives are, and how much they are associated with Ad Cost and the rate at which the surves were completed.

Month_sim %>%
  select(Dollar_Incentive, Survey_Probability) %>%
  arrange(Dollar_Incentive, Survey_Probability) %>% 
  mutate(Dollar_Incentive = as.factor(Dollar_Incentive)) %>%
  ggplot(aes(x = Survey_Probability, y = Dollar_Incentive, height = ..density..)) + 
  geom_density_ridges_gradient(aes(fill = ..x..)) +
  scale_fill_gradientn(
    colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
    name = "Probability"
  )+
  labs(title = 'Survey Probability vs Incentive')
## Picking joint bandwidth of 0.378

Here, we see that as the Incentive increases, so does the probability that a respondent will complete a survey.

NOTE Dollar Incentive can be relative to a respondent group’s median hourly income, and the scale could be a z-score from this mean. This would mean that the simulation can be used to represent all groups, assuming that each group behaves similarly.

Finding the Optimal Incentive

The optimal incentive minimizes the incentive point, while maximizing the probability of taking the survey.

This first opmtimization uses the parameters without the randomness included.

incentive = seq(2,20,1)
inc_func_theoretical <- function(incentive) {
  mean_prob = seq(.001,.03, .001)
  inc_by_prob = (mean_prob * log(incentive, base = exp(xexp)))
  return(max(inc_by_prob))
}
plot(optim(incentive, inc_func_theoretical)$par)

This suggests that at 16 is the lowest incentive with the highest rate of survey completion.

mean_probs_emp <- Month_sim %>%
  group_by(Dollar_Incentive) %>%
  summarize(mean_prob = mean(Survey_Probability)) %>%
  arrange((Dollar_Incentive)) %>%
  select(mean_prob)

inc_func_emp <- function(incentive) {
  inc_by_prob = (mean_probs_emp * log(incentive, base = exp(xexp)))
  return(max(inc_by_prob))
}
inc_prob <- sapply(incentive, inc_func_emp)
plot(incentive, optim(incentive, inc_func_emp)$par)

However, the emprical distribution suggests 15 is the optimal price point.

incentive = seq(2,20,1)

incentive_b_dollar <- tibble(Probability_of_Survey =inc_prob,
              Incentive = incentive)

optimal_price <- (which.max(optim(incentive, inc_func_emp)$par)+1)

ggplot(incentive_b_dollar, aes(x=Incentive, y = Probability_of_Survey)) + 
  geom_point() +
  geom_hline(yintercept = incentive_b_dollar$Probability_of_Survey[optimal_price-1], linetype = "dashed", color = "red", size = .5) +
  geom_vline(xintercept = (optimal_price), linetype = "dashed", color = "red", size = .5)+
  geom_hline(yintercept = incentive_b_dollar$Probability_of_Survey[which(incentive_b_dollar[,"Incentive"]==10)], linetype = "dashed", color = "blue", size = .5) +
  geom_vline(xintercept = 10, linetype = "dashed", color = "blue", size = .5)

The red crosshairs represent the minimum incentive point in which the probability of taking a survey is maximized. The blue crosshairs mark where we believed the optimal point would be.

OP_Sim <- Simulation(dollar_incentive_min = optimal_price, 
                     dollar_incentive_max = optimal_price)

Simulation_Summary(OP_Sim, by = "total")
## # A tibble: 1 x 8
##   Day   Total_Impressio~ Surveys_Complet~ Mean_Probability Mean_Incentive
##   <chr>            <int>            <int>            <dbl>          <dbl>
## 1 30 D~            37651             1187             3.13             15
## # ... with 3 more variables: Total_Incentive_Cost <dbl>,
## #   Mean_Relevance <dbl>, Mean_Response_Quality <dbl>
# SIMULATION PARAMETERS
#          Simulation(days = 30,
#                       daily_impressions=2000, 
#                       mean_relevance = 7.2, 
#                       sd_relevance = .8,
#                       relevance_filter = 7,
#                       probability_sd = 0.024,
#                       cost_sd_rh = 1.75,
#                       cost_sd_rm = 2,
#                       cost_sd_rl = 2,
#                       dollar_incentive_min = 2,
#                       dollar_incentive_max = 20,
#                       quality_reduction_slope = -1/25,
#                       quality_reduction_intercept = 1.4
#                       )

month <- tibble()
Year <- list()
number_simulations <- 100
Optimal_Incentives <- as.numeric(vector(length = number_simulations))
for(i in 1:number_simulations){
month <- Simulation()

mean_probs_emp <- month %>%
  group_by(Dollar_Incentive) %>%
  summarize(mean_prob = mean(Survey_Probability)) %>%
  arrange((Dollar_Incentive)) %>%
  select(mean_prob)

Year[[i]] <- month %>% Simulation_Summary("total")

inc_func_emp <- function(incentive) {
  inc_by_prob = (mean_probs_emp * log(incentive, base = exp(xexp)))
  return(max(inc_by_prob))
}
Optimal_Incentives[i] <- which.max(optim(incentive, inc_func_emp)$par)+1
}
OPI <- as.data.frame(table(Optimal_Incentives))
opt <- OPI[which.max(OPI$Freq),"Optimal_Incentives"]
opt <- as.numeric(as.character(opt))
ggplot(as.data.frame(Optimal_Incentives), aes(x=Optimal_Incentives)) +  
  geom_histogram(aes(y=..density..), colour="black", fill="white", bins = nrow(OPI))+
  geom_density(alpha=.2, fill="#FF6666") +
  geom_vline(xintercept = opt, color = "blue")

The maximum liklihood optimal point is 13.